Skip to content

Add static to fmt string to get it put into rodata#1422

Merged
christos68k merged 1 commit into
open-telemetry:mainfrom
gnurizen:print-fmt-rodata
May 15, 2026
Merged

Add static to fmt string to get it put into rodata#1422
christos68k merged 1 commit into
open-telemetry:mainfrom
gnurizen:print-fmt-rodata

Conversation

@gnurizen
Copy link
Copy Markdown
Contributor

Clang will only put DEBUG_PRINT fmt strings into rodata if they
are "static const char[]", we were missing the static. So instead
clang generates code to construct the fmt string on the stack which
is wasteful of instructions and stack space.

Before:

;     DEBUG_PRINT("beam: no PerCPURecord found");
      10:       18 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 r1 = 0x0 ll
      12:       61 11 00 00 00 00 00 00 w1 = *(u32 *)(r1 + 0x0)
;     DEBUG_PRINT("beam: no PerCPURecord found");
      13:       15 01 9d 03 00 00 00 00 if r1 == 0x0 goto +0x39d <LBB0_107>
      14:       b7 01 00 00 75 6e 64 00 r1 = 0x646e75
;     DEBUG_PRINT("beam: no PerCPURecord found");
      15:       63 1a 78 ff 00 00 00 00 *(u32 *)(r10 - 0x88) = w1
      16:       18 01 00 00 65 63 6f 72 00 00 00 00 64 20 66 6f r1 = 0x6f662064726f6365 ll
      18:       7b 1a 70 ff 00 00 00 00 *(u64 *)(r10 - 0x90) = r1
      19:       18 01 00 00 20 50 65 72 00 00 00 00 43 50 55 52 r1 = 0x5255504372655020 ll
      21:       7b 1a 68 ff 00 00 00 00 *(u64 *)(r10 - 0x98) = r1
      22:       18 01 00 00 62 65 61 6d 00 00 00 00 3a 20 6e 6f r1 = 0x6f6e203a6d616562 ll
      24:       7b 1a 60 ff 00 00 00 00 *(u64 *)(r10 - 0xa0) = r1
      25:       bf a1 00 00 00 00 00 00 r1 = r10
      26:       07 01 00 00 60 ff ff ff r1 += -0xa0
      27:       b7 02 00 00 1c 00 00 00 r2 = 0x1c
      28:       85 00 00 00 06 00 00 00 call 0x6
      29:       05 00 8d 03 00 00 00 00 goto +0x38d <LBB0_107>

After:

;     DEBUG_PRINT("beam: no PerCPURecord found");
      10:       18 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 r1 = 0x0 ll
      12:       61 11 00 00 00 00 00 00 w1 = *(u32 *)(r1 + 0x0)
;     DEBUG_PRINT("beam: no PerCPURecord found");
      13:       15 01 11 02 00 00 00 00 if r1 == 0x0 goto +0x211 <LBB0_100>
      14:       18 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 r1 = 0x0 ll
      16:       b7 02 00 00 1c 00 00 00 r2 = 0x1c
      17:       85 00 00 00 06 00 00 00 call 0x6
      18:       05 00 0c 02 00 00 00 00 goto +0x20c <LBB0_100>

Clang will only put DEBUG_PRINT fmt strings into rodata if they
are "static const char[]", we were missing the static. So instead
clang generates code to construct the fmt string on the stack which
is wasteful of instructions and stack space.

Before:

```
;     DEBUG_PRINT("beam: no PerCPURecord found");
      10:       18 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 r1 = 0x0 ll
      12:       61 11 00 00 00 00 00 00 w1 = *(u32 *)(r1 + 0x0)
;     DEBUG_PRINT("beam: no PerCPURecord found");
      13:       15 01 9d 03 00 00 00 00 if r1 == 0x0 goto +0x39d <LBB0_107>
      14:       b7 01 00 00 75 6e 64 00 r1 = 0x646e75
;     DEBUG_PRINT("beam: no PerCPURecord found");
      15:       63 1a 78 ff 00 00 00 00 *(u32 *)(r10 - 0x88) = w1
      16:       18 01 00 00 65 63 6f 72 00 00 00 00 64 20 66 6f r1 = 0x6f662064726f6365 ll
      18:       7b 1a 70 ff 00 00 00 00 *(u64 *)(r10 - 0x90) = r1
      19:       18 01 00 00 20 50 65 72 00 00 00 00 43 50 55 52 r1 = 0x5255504372655020 ll
      21:       7b 1a 68 ff 00 00 00 00 *(u64 *)(r10 - 0x98) = r1
      22:       18 01 00 00 62 65 61 6d 00 00 00 00 3a 20 6e 6f r1 = 0x6f6e203a6d616562 ll
      24:       7b 1a 60 ff 00 00 00 00 *(u64 *)(r10 - 0xa0) = r1
      25:       bf a1 00 00 00 00 00 00 r1 = r10
      26:       07 01 00 00 60 ff ff ff r1 += -0xa0
      27:       b7 02 00 00 1c 00 00 00 r2 = 0x1c
      28:       85 00 00 00 06 00 00 00 call 0x6
      29:       05 00 8d 03 00 00 00 00 goto +0x38d <LBB0_107>
```

After:

```
;     DEBUG_PRINT("beam: no PerCPURecord found");
      10:       18 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 r1 = 0x0 ll
      12:       61 11 00 00 00 00 00 00 w1 = *(u32 *)(r1 + 0x0)
;     DEBUG_PRINT("beam: no PerCPURecord found");
      13:       15 01 11 02 00 00 00 00 if r1 == 0x0 goto +0x211 <LBB0_100>
      14:       18 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 r1 = 0x0 ll
      16:       b7 02 00 00 1c 00 00 00 r2 = 0x1c
      17:       85 00 00 00 06 00 00 00 call 0x6
      18:       05 00 0c 02 00 00 00 00 goto +0x20c <LBB0_100>
```
@gnurizen
Copy link
Copy Markdown
Contributor Author

gnurizen commented May 15, 2026

Not sure if there was a reason why we were avoiding rodata for fmt strings, un-happy accident I suppose.

Instruction count: before vs. after static const char[] fmt strings

  ┌─────────────────────────────────────┬────────┬───────┬──────┬────────┐
  │               Program               │ Before │ After │  Δ   │   Δ%   │
  ├─────────────────────────────────────┼────────┼───────┼──────┼────────┤
  │ kprobe/unwind_beam                  │    987 │   576 │ −411 │ −41.6% │
  ├─────────────────────────────────────┼────────┼───────┼──────┼────────┤
  │ perf_event/unwind_beam              │    987 │   576 │ −411 │ −41.6% │
  ├─────────────────────────────────────┼────────┼───────┼──────┼────────┤
  │ kprobe/unwind_dotnet                   │   1642 │   931 │ −711 │ −43.3% │
  ├────────────────────────────────────────┼────────┼───────┼──────┼────────┤
  │ kprobe/unwind_dotnet10                 │   1538 │   904 │ −634 │ −41.2% │
  ├────────────────────────────────────────┼────────┼───────┼──────┼────────┤
  │ perf_event/unwind_dotnet               │   1642 │   931 │ −711 │ −43.3% │
  ├────────────────────────────────────────┼────────┼───────┼──────┼────────┤
  │ perf_event/unwind_dotnet10             │   1538 │   904 │ −634 │ −41.2% │
  ├────────────────────────────────────────┼────────┼───────┼──────┼────────┤
  │ kprobe/generic                         │   1142 │   725 │ −417 │ −36.5% │
  ├────────────────────────────────────────┼────────┼───────┼──────┼────────┤
  │ kprobe/go_labels                       │    665 │   371 │   −294 │ −44.2% │
  ├────────────────────────────────────────┼────────┼───────┼────────┼────────┤
  │ perf_event/go_labels                   │    665 │   371 │   −294 │ −44.2% │
  ├────────────────────────────────────────┼────────┼───────┼────────┼────────┤
  │ kprobe/unwind_hotspot                  │   2009 │  1225 │   −784 │ −39.0% │
  ├────────────────────────────────────────┼────────┼───────┼────────┼────────┤
  │ perf_event/unwind_hotspot              │   2009 │  1225 │   −784 │ −39.0% │
  ├────────────────────────────────────────┼────────┼───────┼────────┼────────┤
  │ tracepoint/integration/sched_switch    │    140 │   128 │    −12 │  −8.6% │
  ├────────────────────────────────────────┼────────┼───────┼────────┼────────┤
  │ kprobe/unwind_stop                     │   1648 │   988 │   −660 │ −40.0% │
  ├────────────────────────────────────────┼────────┼───────┼────────┼────────┤
  │ perf_event/unwind_stop                 │   1648 │   988 │   −660 │ −40.0% │
  ├────────────────────────────────────────┼────────┼───────┼────────┼────────┤
  │ kprobe/unwind_native                   │   1525 │   979 │   −546 │ −35.8% │
  ├────────────────────────────────────────┼────────┼───────┼────────┼────────┤
  │ perf_event/native_tracer_entry         │   1235 │   781 │   −454 │ −36.8% │
  ├────────────────────────────────────────┼────────┼───────┼────────┼────────┤
  │ perf_event/unwind_native               │   1525 │   979 │   −546 │ −35.8% │
  ├────────────────────────────────────────┼────────┼───────┼────────┼────────┤
  │ kprobe/finish_task_switch              │   1182 │   752 │   −430 │ −36.4% │
  ├────────────────────────────────────────┼────────┼───────┼────────┼────────┤
  │ kprobe/dummy                           │      6 │     6 │      0 │   0.0% │
  ├────────────────────────────────────────┼────────┼───────┼────────┼────────┤
  │ tracepoint/sched/sched_switch          │     53 │    36 │    −17 │ −32.1% │
  ├────────────────────────────────────────┼────────┼───────┼────────┼────────┤
  │ kprobe/unwind_perl                     │   1613 │   975 │   −638 │ −39.6% │
  ├────────────────────────────────────────┼────────┼───────┼────────┼────────┤
  │ perf_event/unwind_perl                 │   1613 │   975 │   −638 │ −39.6% │
  ├────────────────────────────────────────┼────────┼───────┼────────┼────────┤
  │ kprobe/unwind_php                      │    957 │   547 │   −410 │ −42.8% │
  ├────────────────────────────────────────┼────────┼───────┼────────┼────────┤
  │ perf_event/unwind_php                  │    957 │   547 │   −410 │ −42.8% │
  ├────────────────────────────────────────┼────────┼───────┼────────┼────────┤
  │ kprobe/unwind_python                   │   1525 │   961 │   −564 │ −37.0% │
  ├────────────────────────────────────────┼────────┼───────┼────────┼────────┤
  │ perf_event/unwind_python               │   1525 │   961 │   −564 │ −37.0% │
  ├────────────────────────────────────────┼────────┼───────┼────────┼────────┤
  │ kprobe/unwind_ruby                     │   1575 │  1043 │   −532 │ −33.8% │
  ├────────────────────────────────────────┼────────┼───────┼────────┼────────┤
  │ perf_event/unwind_ruby                 │   1575 │  1043 │   −532 │ −33.8% │
  ├────────────────────────────────────────┼────────┼───────┼────────┼────────┤
  │ tracepoint/sched/sched_process_free/v2 │    278 │   170 │   −108 │ −38.8% │
  ├────────────────────────────────────────┼────────┼───────┼────────┼────────┤
  │ tracepoint/sched/sched_process_free/v1 │    278 │   170 │   −108 │ −38.8% │
  ├────────────────────────────────────────┼────────┼───────┼────────┼────────┤
  │ tracepoint/syscalls/sys_enter_bpf      │     52 │    39 │    −13 │ −25.0% │
  ├────────────────────────────────────────┼────────┼───────┼────────┼────────┤
  │ raw_tracepoint/sys_enter               │     62 │    44 │    −18 │ −29.0% │
  ├────────────────────────────────────────┼────────┼───────┼────────┼────────┤
  │ kprobe/unwind_v8                       │   1657 │  1030 │   −627 │ −37.8% │
  ├────────────────────────────────────────┼────────┼───────┼────────┼────────┤
  │ perf_event/unwind_v8                   │   1657 │  1030 │   −627 │ −37.8% │
  ├────────────────────────────────────────┼────────┼───────┼────────┼────────┤
  │ Total                                  │  39110 │ 23911 │ −15199 │ −38.9% │
  └────────────────────────────────────────┴────────┴───────┴────────┴────────┘

@gnurizen gnurizen marked this pull request as ready for review May 15, 2026 11:18
@gnurizen gnurizen requested review from a team as code owners May 15, 2026 11:18
Copy link
Copy Markdown
Member

@florianl florianl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch! 👍

@christos68k christos68k merged commit 7796e25 into open-telemetry:main May 15, 2026
33 checks passed
gnurizen added a commit to parca-dev/opentelemetry-ebpf-profiler that referenced this pull request May 26, 2026
gnurizen added a commit to parca-dev/opentelemetry-ebpf-profiler that referenced this pull request May 26, 2026
…imit

After kernel-bump open-telemetry#1310 removed the trailing "\n" from the printt
____fmt[] declaration, the recompiled BPF program cuda_activity_batch_tail
exceeds the 6.16 verifier's path-tracking complexity budget and is
rejected with "argument list too long". The cuda.ebpf.c source did not
change, but the compiler's inlining/path-analysis decisions shifted.

Making ____fmt[] static puts the per-call-site format strings in .rodata
instead of materializing them on the BPF stack at every callsite, which
substantially reduces per-program complexity (instruction counts: amd64
62826 -> 44954, arm64 71875 -> 53960).

Source change ported from upstream PR open-telemetry#1422 (commit 7796e25). Rebuilt
binaries locally to match the current source tree.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants